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We consider the problem of assessing goodness of fit of a single Bayesian model to the observed data in 
i__i the inverse problem context. A novel procedure of goodness of fit test is proposed, based on construction 

^-H of reference distributions using the 'inverse' part of the given model. This is motivated by an example 

> 

«ys from palaeoclimatology in which it is of interest to reconstruct past climates using information obtained 

from fossils deposited in lake sediment. Since climate influences species, the model is built in the forward 

sense, that is, fossils are assumed to depend upon climate. The model combines 'modern data' which 

\-ml consists of observed species composition and the corresponding observed climates with 'fossil data'; the 

latter data consisting of fossil species composition deposited in lake sediments for the past thousands of 



X 



years, but the corresponding past climates are unknown. Interest focuses on prediction of unknown past 
climates, which is the inverse part of the model. 

Technically, given a model f(Y | X, 6), where Y is the observed data and X is a set of (non- 
random) covariates, we obtain reference distributions based on the posterior ir(X \ Y), where X must 
be interpreted as the unobserved random vector corresponding to the observed covariates X. Put simply, 
if the posterior distribution ir(X \ Y) gives high density to the observed covariates X, or equivalcntly, 
if the posterior distribution of T(X) gives high density to T(X), where T is any appropriate statistic, 
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then we say that the model fits the data. Otherwise the model in question is not adequate. We provide 
decision-theoretic justification of our proposed approach and discuss other theoretical and computational 
advantages. We demonstrate our methodology with many simulated examples and three complex, high- 
dimensional, realistic palaeoclimate problems, including the motivating palaeoclimate problem. 

Although our proposal is ideally suited for checking model fit in inverse regression problems, we 
indicate that the proposal may be potentially extended for model checking in quite general Baycsian 
problems. However, we do not claim to have solved all issues involved; in fact, our aim in this paper is 
to discuss advantages of, and also to shed light on issues that could be potential future research topics. 
If nothing else, we hope to have been able to make a step forward in the right direction. 

Keywords: Bayesian hierarchical model; Discrepancy measure; Importance Resampling; Loss function; 
P-value, Reference distribution 

1 Introduction 

To quote Gelman et al. (1996), assessing the plausibility of a posited model (or of assumptions in general) 
is always fundamental, especially in Bayesian data analysis. Compared to the vast classical statistical 
literature that attempts to address the question of model assessment, the Bayesian literature is much 
scarce. Gelman et al. (1996) seems to be the first to attempt extension of the essence of the classical 
approach to the Bayesian framework. Their approach is based on computing the posterior distribution of 
the parameters given the data and then to compute a P-value, involving a discrepancy measure, which is a 
function of the data as well as the parameters. Their approach differs from the available classical approaches 
mainly in introducing a discrepancy measure that depends on the parameters as well. Bayarri and Berger 
(2000) introduced two alternative P-values and demonstrated that they are advantageous compared to 
the P-value of Gelman et al. (1996). In this paper, we introduce an approach based on 'inverse reference 
distributions' (IRD). We argue that the approach is best suited for assessing Bayesian model fit in inverse 
problems but may be extended to quite general Bayesian problems. The proposal is novel compared to the 
available approaches and has some distinct advantages. 

The motivating example arises in quantitative palaeoclimate reconstruction where 'modern data' con- 



sisting of multivariate counts of species are available along with the observed climate values. Also available 
are fossil assemblages of the same species, but deposited in lake sediments for past thousands of years. This 
is the fossil species data. However, the past climates corresponding to the fossil species data are unknown, 
and it is of interest to predict the past climates given the modern data and the fossil species data. Roughly, 
the species composition are regarded as functions of climate variables, since in general ecological terms, 
variations in climate drives variations in species, but not vice versa. However, since the interest lies in 
prediction of climate variables, the inverse nature of the problem is clear. The past climates, which must 
be regarded as random variables, may also be interpreted as unobserved covariates. It is thus natural to 
put a prior probability distribution on the unobserved covariates. 

Interestingly, the approach used for prediction of past climates motivates our Bayesian approach to 
assessment of model adequacy, in particular, for inverse regression problems, using posterior distributions 
based on prior probability distributions on covariates, which are treated as unknown. Broadly, we say that 
the model fits the data if the posterior distribution of the random variables corresponding to the covariates 
capture the observed values of the covariates. Otherwise, the model does not fit the data. It is worth 
noting that although the values of the covariates are known, we propose to fit the model assuming that the 
values are unknown and predict the random variables that stand for the unknown covariates. The covariates 
predicted in this manner can then be compared with the originally observed values to assess model fit in 
a fully Bayesian manner. 

The rest of our paper is structured as follows. In Section 2 we review the existing literature on model 
assessment, all of which are concerned with forward problems. The key idea of our IRD approach is 
presented in Section 3, and in Section 4 we provide a decision theoretic justification of our proposed 
approach. In Section 5 we note that improper priors may render the reference posterior improper; in 
this context we suggest a remedy using cross-validation. We provide a summary of our illustrations of 
the IRD approach with various examples in Section 6. Application of our methodology to the motivating 
palaeoclimate problem is discussed in Section 7. Conclusions and future work are discussed in Section 8. 

Further details on methods, experiments and data analyses are provided in the supplement Bhat- 
tacharya (2012), whose sections, figures and tables have the prefix "S-" when referred to in this paper. 



Here we briefly describe the contents of the supplement. Some relevant discrepancy measures for IRD 
are provided in Section S-l of the supplement. Further details regarding prior construction for the IRD 
approach in addition to that presented in Section 5, are provided in Section S-2. In Section S-3 we provide 
a brief overview of Importance Resampling MCMC (IRMCMC) proposed by Bhattacharya and Haslett 
(2007), for cross-validation in inverse problems, which is an indispensable computational method for IRD. 
The complete details of the summary of the illustrations of IRD with simulation examples outlined in Sec- 
tion 6, are presented in Section S-4. In Sections S-5 and S-6 we discuss applications of IRD to extensions 
of the motivating palaeoclimate problem presented in Section 7. 

2 Overview of methods of model assessment in forward problems and 
their limitations 

One approach for checking the fit of a model is by examining the marginal distribution of the data (Box 
(1980)). Specifically, if the marginal density of Y is small, then Y is unlikely under the given model. 
A problem with this approach is that, for improper prior on the parameter 9, the marginal is improper. 
Another problem is to decide on precisely how small the marginal should be so that Y can be treated as 
unlikely to lead to rejection of the model. The approach of reference distributions may be applied to this 
idea, but the problem of impropriety of the marginal for improper priors is an impediment. Cross-validation 
may be used as a proxy for the marginal, but in this case, strictly speaking, data Y would be used twice; 
once to compute the cross-validation posteriors {ir(- \ X,Y—i);i = l,...,n} and again to construct the 
discrepancy measure. 

Gelman et al. (1996) recommended generalised test statistics T(Y,9) that depend on the parameters 
as well as the data, and proposed a Bayesian P-value for assessing goodness of fit. However, their model 
checking strategy uses the data twice: once to compute the observed statistic T(Y, 9) and again to obtain 
the posterior predictive reference distribution. Bayarri and Berger (2000) demonstrated with examples that 
using data twice is undesirable; see also Ghosh et al. (2005). Specifically, in such cases, even with arbitrarily 
strong evidence against the null model, the P-value does not tend to zero. Also, the posterior predictive 



P-values do not generally have a uniform distribution under the null hypothesis, not even asymptotically 
(Bayarri and Berger (2000), Robins et al. (2000)). 

Bayarri and Berger (2000) developed a related approach based on posterior distributions that condition 
on only part of the information in the data rather than using the full posterior distribution to define 
the reference distribution. Robins et al. (2000) showed that their P-values are asymptotically uniformly 
distributed under the null hypothesis. But it is not clear to the author of this paper (see also the discussion 
by Evans of the paper by Bayarri and Berger) if the same is true for a finite sample size. Another important 
point is that, given a specific discrepancy measure computation of their P-value is burdensome and requires 
knowledge of the analytic form of the density of the specific discrepancy measure, which is not available in 
general. Arguments are provided in Bayarri and Berger (2000) that estimation of the density of a particular 
discrepancy measure is not difficult; however, since the above authors did not provide any guidelines how to 
choose the right discrepancy measure, many possible discrepancy measures must be considered. But then 
computation of P-values for each discrepancy measure has to be done afresh; this will certainly become 
computationally very expensive for complex problems. Stern and Cressie (2000) pointed out that the 
approach of Bayarri and Berger (2000) can be very difficult to apply for the kinds of complex models that 
are most challenging to check in practice. Bayarri and Berger (2000) demonstrated their proposal with 
many theoretical examples, but they did not provide assessment of the performance of their methodology 
in the case of complex, real problems. 

3 The key idea of IRD 

The essential idea of constructing an IRD can be described as follows. Suppose that data Y = {yi},i = 
1, . . . , n are available. Also available, suppose, are covariates X = {xi},i = 1, . . . , n. Each of yi or X{ may 
also be multivariate. We assume that there is a probability model associated with Y, given covariates X. 
We also assume, as is natural, that X is not associated with any probability model. So, we treat Y as 
the data, but X as fixed constants. To proceed with our approach we first pretend that the values of the 
covariates are unknown, probabilistically interpreted as random variables, which we denote as X. This 



unknown set of random variables X, which may also be thought of as a replicate of the observed covariates 
X, must be predicted from data Y, in an inverse sense. If the predicted values of X are consistent with 
observed X then we say that the model fits the data adequately, otherwise we say that the model does 
not fit the data. A fully Bayesian approach to this prediction problem requires computation of an inverse 
reference distribution based on the posterior 



tt(X I Y) oc f tt(X, 9)L(Y, X, 9)d9 (1) 



In the above, L denotes the likelihood of the unknowns (X, 6) where 9 is the set of model parameters. It is 
important to observe that the above posterior does not depend upon the observed covariates X; in other 
words, the model is fitted without using the observed covariates. In the above posterior both 6 and X can 
be regarded as unknown parameters. In our approach the model parameter 9 will be regarded as a set of 
nuisance parameters (more discussion to follow subsequently) and X will be regarded as the parameters 
of interest. The prior on X and 9 has been denoted by ir(X,9). We discuss in this paper that, based on 
whether or not observed X is supported by the above posterior, an effective overall goodness-of-fit test, 
which has some desirable properties, can be devised. 

Observe that in our proposal, the data is not used twice, since computation of the posterior (1) involves 
conditioning on Y alone. The observed covariates will be used only for the construction of the discrepancy 
measure. Denoting by T(X) a discrepancy measure involving only observed covariates X, we construct 
the corresponding reference distribution of the random discrepancy measure T(X). Some examples of 
discrepancy measures are provided in Section S-l of the supplement; for applications in this paper we 
throughout use the discrepancy measure (1) of Section S-l, given by 

(xi - En(xi)) 2 



T(X)=T 1 (X) = J2 



V n (xi) 



i=l 

where E n and V v denote the mean and the variance with respect to ir(- \ Y). Then if T(X) lies within 
the appropriate credible region of T(X), the model will be accepted, otherwise it will be rejected. This we 
formalize decision theoretically in Section 4. Thus, unlike other approaches (both Bayesian and classical), 
we have clearly defined a method that can decide whether to accept or to reject the model in question. An 
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important issue to address in this context is whether or not the discrepancy measure should be allowed 
to depend upon the model parameters 6. In the forward context, Gelman et al. (1996) defined general 
discrepancy measures that depend upon both data and the model parameters. However, in our inverse 
approach letting the discrepancy measure depend upon the model parameters will often not be meaningful, 
unless we let the discrepancy measure also depend upon Y. But this would imply using data Y twice; once 
to compute the posterior (1) and again to compute the discrepancy measure. So we strongly recommend 
that discrepancy measures be independent of the model parameters. We discuss this with an example. 
Let us consider a Poisson regression model, which we will use to illustrate our proposal, given by yi ~ 
Poisson(0Xi); i = 1, . . . , n. In the forward context, a discrepancy measure based on the residuals yi — 9xi 
seems natural. However, in our inverse approach an analogous measure is not permissible, since this would 
entail using Y twice, as indicated above. Indeed, one of our aims is to avoid using double use of the data. 
Moreover, in the case of complex hierarchical models there may be thousands of model parameters and 
in such cases it is not clear how to construct a sensible discrepancy measure using such high-dimensional 
model parameter. In our opinion, it makes more sense to integrate out the model parameters and base the 
discrepancy measure solely on the covariates. In other words, we treat the model parameters as nuisance 
parameters in our approach. For details regarding nuisance parameters see Berger et al. (1999). 

In the next section we formalize our proposed approach by providing a decision theoretic justification. 
Based on "0-1" loss function, we also provide a simple, but explicit, formula for accepting or rejecting the 
model in question. 

4 Decision theoretic justification of our proposed IRD approach 

If the data really come from the model assumed, then for any general discrepancy measure T, T(X) is 
expected to give high probability density to the point T(X). In other words, we say that the model does 
not fit the data if for some pre-assigned quantity e, 



T(X) - T(X) 



V n (T(X) | Y) 



> e 



with high posterior probability. Here we remind the reader that X is to be considered a set of random 
variables or unknown parameters corresponding to the true values X. The random and observed discrep- 
ancy measures T(X) and T(X) can likewise be treated as a parameter and the true value of the parameter 
respectively. Using this framework, it is easy to formulate a Bayesian hypothesis testing problem, in the 
spirit of that provided in Berger (1985). Note that we could not do the same if the discrepancy measure 
were dependent on data Y; this is because Y is the data arisen from a probability model and can not 
be interpreted as parameter. Since all other available approaches to model assessment use discrepancy 
measures involving data Y, they do not have the Bayesian decision theoretic framework. 
To put it formally, we are interested in testing 

T(X) - T(X) 



against 



H 



Hi 



V V (T(X) | Y) 

T(X) - T(X) 



< e 



> e 



X(T(X) | Y) 

For k = 0,1, let Tk denote the parameter space of T(X) implied by H k - We denote acceptance of H k by a k 

and consider the "0-1" loss function C(T(X), a k ) = OifT(X) G T k and C(T(X), a k ) = 1 ifT(X) eTfJ^k. 

Then Bayes action (see Berger (1985) for definition) is simply that for which the posterior expected 

loss E n r.\Y){C(T(X), a k )}', k = 0, 1 is the smaller, which implies that the Bayes decision is simply the 



hypothesis with larger posterior probability. If 



P = 7T 



T(X) - T(X) 



< e I Y 



(3) 



'V n (T(X) | Y) 

then Hq is to be accepted if p > 1 —p, i.e. p > 1/2. It is important to note that the posterior probability is 
not a P-value, nor is it related to any P-value of any kind. It is simply a posterior probability of a unknown 
parameter. Also, very clearly, the data is not used twice in computing the posterior probability. Due to 
this reason the probability given by (3) has appropriate behaviour under the null hypothesis; in fact, very 
clearly, the posterior probability has uniform distribution for any size of the data. It is useful to briefly 



clarify in this context the issue of double- use of the data and the consequences. The posterior predictive 
P-value of Gelman et al. (1996) is defined as 



P P ost{Y) = f Pr(T(Y,ff) > T{Y,9) | 6)ir(6 \ Y)d6 



The (frequentist) distribution of P pos t(Y) depends upon the distribution of the conditioned "data" Y. 
Since, Y is used twice, the P-value is "overconfident" and the distribution is not even asymptotically 
Uniform(0, 1). Now consider the posterior probability that corresponds to the IRD approach, as below: 

Pird(Y) = Pr{T(X) > T(X) \ Y) 

The (frequentist) distribution of Pird(Y) again depends upon "data" Y, but does not vary with the 
observed covariates X since the latter will always be treated as fixed. But Y is used only once even if we 
use cross-validation posteriors 7t(xj | X-i,Y) to obtain a modification of Pird(Y) (this will be discussed 
subsequently). Hence, Pird(Y), as well as the modfication using cross-validation, are always distributed 
as Uniform(0, 1), irrespective of the sample size! Thus, the decision theoretic framework formally shows 
that our proposed approach is fully Bayesian with a solid theoretical justification. Other available model 
checking methods, which are all based on the forward part do not have appropriate calibration properties, 
at least if the size of the data set is finite. Sellke et al. (2001) attempt to provide calibration of P- values, but 
that is a "lower bound" calibration which may be too low, especially for larger sample sizes (see the rejoinder 
of Bayarri and Berger (2000)). In the case of forward problems, Hjort et al. (2006) proposed a method of 
calibration, but the method seems to work only if the prior distribution of the model parameters is proper. 
Hence, although the proposal of Hjort et al. (2006) is promising, given that improper prior distribution is 
very widely used, it is also useful to seek alternative criteria. 

The choice of e may be subjective, differing from problem to problem. However, under the true model, 
we would expect the predicted values of X (which may be posterior means, medians, modes, etc.) to be 
close to observed X. Thus, under the true model, T(X) ~ and we would expect 



T(X) - T(X) 



V n (T(X) | Y) 



\T(X)\ 



V«(T(X) | Y) 



under Hq. Hence percentiles of the random variable on the right hand side of (4) may be reasonable 
choices of e. In other words, as a rule of thumb, for a particular choice of a £ (0, 1), e may be regarded 
as the (1 — a)-th percentile of , ' v _ ^=. In this paper we illustrate the power of our test by observing 

V ' y/V*(T(X)\Y) 

if the observed discrepancy measure T(X) falls within the relevant 100(1 — a) = 97% credible region of 
the above reference distribution. We have, however, experimented with several other sizes of the credible 
region corresponding to T(X), but our main conclusions remained unchanged. 

5 Impropriety of the reference distribution and remedy using cross- 
validation 

Note that the integrand of (1) involves the model parameters 6 as well as X. Thus there will be more 
random variables than the number of data points if each of X{ and yi are of same dimensionality. In this 
case, improper prior on any unknown, X ox 6 (the prior on X, if empirically estimated from X, may be 
proper, but the prior on 8 will often be taken as improper in complex hierarchical Bayesian problems), 
will make the posterior distribution ir(X, 6 \ Y) improper if the data fails to provide information on that 
unknown. Using the already introduced Poisson regression model, we illustrate the issue of impropriety of 
the joint posterior if 6 has improper prior. 

Assume yi ~ Poisson(0Xi);i = 1, . . . , n where there are n data points but n + 1 unknowns in X and 9. 
Let us consider a proper prior for X, given by 

n 

vr(A > )«[]exp(-/3x J )xr 1 (5) 

As for 0, suppose we use a uniform improper prior tt(6) = 1 for all 6. Then, the joint posterior of X and 
#is 

!n \ n 

-(0 + /3)^f i l^"=i K ]^[xf +a " 1 (6) 

i=l ) i=l 

The marginal posterior of 9 is given by 

ir(6 I Y) oc -j— -r (7) 

10 



However, 

-d0 = / z an - 2 {l-z)^=^dz (8) 



Now observe that, 



o (9 + P)(v^) J 

and the above integration converges if and only if n > —. In other words, if the prior on X is vague, which 
is signified by a ~ 0, (3 ~ 0, then the data size n has to be impractically large to render the posterior of 9 
proper. Hence, in general, for this problem, 

f ir(9 | Y)d6 = oo (9) 

ir(X, 9 | Y)dXd9 = f j f ir(X \ Y, 9)dX 1 n(6 \ Y)d9 (10) 

Note, that in (10), for each 9, tt(X \ Y, 9) is proper; in fact, just a product of proper Gamma densities (can 
easily be seen from (6)), so J ir(X \ Y,9)dX = 1. However, since the term J n(9 \ Y)d9 = oo by (9), the 
above integral (10) is infinity as well. Hence, the joint posterior n(X, 9 \ Y) is improper. 

We note, however, that it is not necessarily the case that the distribution of T(X) will be improper if the 
posterior ir(X \ Y) is improper (to consider a pedagogical example, if 7r(V>) = l;ip €. (0,oo) then exp(— ip) 
has a proper distribution on (0, 1)). However, in general, the distribution of T{X) will be analytically 
intractable, and it must be obtained using MCMC simulations of (X, 9) from the joint posterior ir(X, 9 \ Y) 
(since the marginal posterior distribution ir(X \ Y) is analytically intractable). Now, if the joint posterior 
tt(X, 9 | Y) is improper, then MCMC simulations from this posterior will not make sense. Hence, it is very 
important that the joint posterior is proper. 

To avoid the problem of impropriety, we propose to approximate the true posterior distribution us- 
ing cross-validation. In other words, we propose to simulate from the leave-one-out posteriors {vt(xj | 
X-i,Y);i = l,...,n}, where X-i stands for the data, omitting in each case the corresponding x%. The 
random variable x« corresponds to the omitted value Xj. Note that the leave-one-out posterior with case i 
omitted is given by 

7r(xi | X-i,Y) oc f ir(xi,e)f( yi | x i ,9)\[f{y j \ Xj ,6)d9 (11) 
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In the above integrand, X{ and 9 are the only random variables, whereas Y is the data and Xj]j 7^ i are 
known constants. Hence there are much less unknowns compared to the number of knowns; this usually 
results in a proper posterior of Xj and 9. With the above Poisson regression example, but with improper 
priors on both X and 9, that is, 7t(xj) = 1; Xi > for i = 1, . . . , n and tt(9) = 1; # > 0, it can be shown 
that the cross-validation posterior of 9, with Xj deleted, is given by 

<0 I X^Y) = l y;^ J , e&'# w 1 ) ex P -9 J2 *j , (12) 

which is a Gamma distribution. The cross-validation posterior of Xj, when Xj is deleted, is given by 



E^^j r(£. =1% + ij 



x w 



7T(Xi I X_i,y) = — . „ r 7^ — V (13) 

r(s/i + i)r(E^i%-) (x + E- ^) 

Clearly, both the marginal cross-validation posteriors (12) and (13) are proper, although the priors of 9 
and Xi are improper. Certainly, unlike in the case of (10), the joint posterior ir(xi,9 \ X_i,Y) is proper. 
Hence, MCMC simulation from 7t(xj, 9 | -X_j, 1") is not problematic, even though the priors of both Xj and 
9 are improper. 

Observe that the set of leave-one-out posteriors {7t(xj | -X_$, 1"); i = 1, . . . , n} is equivalent to 7r(xi, . . . , x n 
Y). To see this, let Xq = (xio, • • • , x n o)' be any fixed point in the support of 7r(xi, . . . ,x n \ Y). Then it 
holds that 



7r(xi,...,x n I Y) 



7r(xi I X 2 , . ■ ■ , Xn, Y) 7r(x 2 | XjQ, X 3 . ■ ■ , X ra , F) 
7r(xio I X 2 , • • • , X„, Y) 7t(x 2 I #10, X 3 , . . . , X n , Y) 

— i --vr(xio,...,x„ I y) (14) 

■K{x n0 I xio,. • .,x n -io,Y) 



This follows from Brook's lemma (Brook (1964)), the usefulness of which is exposed in Besag (1974). 
Equation (14) expresses the joint distribution 7r(xi, . . . ,x n | Y) in terms of the full conditional distributions. 
Note that the factor 7r(xio, • • • , x„o | Y) appearing on the right-hand side of (14) is just a constant. Hence 
the joint distribution is determined by the full conditional distributions up to a proportionality constant. 
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So, under the true model a draw from each of ir(- | -X~_j, Y); i = 1, . . . , n, will approximate n(x\, . . . ,x n \ 
Y). In particular, even if the true posterior distribution 7r(xi, . . . ,x n \ Y) is improper, the approximating 
posterior distribution induced by the cross-validation posteriors, will be proper (see also Gelfand (1996), 
Carlin and Louis (2000)). In fact, in such case, {xi, . . . ,x n } can be looked upon as just a realisation from 
7r(- J Y). One might argue that strictly speaking, X is being used twice; once to compute the posteriors and 
again to construct the discrepancy statistic. However, we had argued earlier that X may not be treated 
as the data since there is no probability model associated with it. Even so, if (X, Y) is considered the 
entire data set, then since Y is not used to compute the discrepancy measure, the entire data set is not 
used twice in our implementation. In problems where there are no impropriety issues, our experiments 
revealed (not reported in this paper) that the results obtained by directly computing (1) are equivalent 
to the results obtained by implementing this cross-validation idea. Since in a very large class of Bayesian 
models the impropriety problem will arise, for the sake of generality we recommend this cross-validation 
idea for implementation of model adequacy test. Moreover, cross-validation has a nice intuitive appeal, 
and can provide insight into finer aspects of the data in addition to providing an overall goodness of fit 
statistic. For instance, it can be checked if any individual Xi is an outlier with respect to the Bayesian 
model; for details, see Section S-l of the supplement. Also, by noting the number of observed Xi falling 
within the respective credible regions it is possible to obtain more information about model fit issues. This 
is exactly the procedure we use for gaining insight into model fit issues of the motivating palaoclimate 
example; see Sections 7, S-4, S-5 and S-6 for details. 

For sufficiently large data sets, obtaining samples from all the leave-one-out posteriors {ir(- | -X_j, Y); i = 
1, . . . , n} seems to be a daunting task. However, the IRMCMC methodology of Bhattacharya and Haslett 
(2007) (see Section S-3 for an overview) can be employed to generate samples from the inverse cross- 
validation posteriors in a very fast and efficient manner. Once samples from the leave-one-out posteriors 
are obtained, distributions of any discrepancy measures can be trivially obtained using the samples. This is 
in sharp contrast with the methodology of Bayarri and Berger (2000) (albeit their methodology is developed 
keeping the forward context in mind), since their proposal requires re-computation for each discrepancy 
measure and hence is computationally burdensome. 
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We finish this section by summarising the important differences between our approach and the ap- 
proaches of Gelman et al. (1996) and Bayarri and Berger (2000) in Table 1. 

6 Summary of simulation studies illustrating our IRD approach 

In the supplement we consider five examples to illustrate our approach. Due to issues related to space here 
we only consider a summary of the simulation studies and refer to the supplement for the details. 

In Examples 1 and 2 we assume that given x±, ■ ■ ■ , xio, which are drawn randomly from Uniform(l, 2), 
the data y±, . . . ,y\o come from Geometric(pi) , where pi = 1/(1 + 9xi). We further assume that the data 
has been modeled as Poisson(9xi). A uniform improper prior has been put on 9, that is, tt(9) = 1; 6 > 0. 
In this case the two models are expected to agree closely when 9 is small but increasing disagreement is 
expected for increasingly large values of 6. In Example 1 we consider the forward approach, where instead 
of constructing a reference distribution of T(X) we consider a reference distribution for T(Y). Here Y is 
defined analogously as X. The forward approach is then compared and contrasted with our IRD approach 
of Example 2, where we assume Uniform(l, 2) prior on xf,i = 1, ... , 10. The results of both the examples 
yielded the results expected — that for small 9, the incorrect Poisson model has high chance of being 
accepted and high chance of rejection for high values of 9. Interestingly, the forward approach displayed 
slightly greater power compared to our IRD approach. This is to be expected since the forward approach 
only requires the probability model of Y (which is the same as that of Y) for computing ir(Y \ X, Y), and 
the probability model of Y is stronger than our weak prior assumption on X that we considered in the 
inverse approach. 

The undesirable relatively lower power of the IRD approach in the first two examples prompted further 
investigation. That the power of our IRD approach can indeed be improved with more informative priors 
on X is the issue we demonstrate in Example 3. In this example we assume that for i = 1, ■ ■ ■ , 10, data 
Hi come from the true model Poisson(9xi). The elements of X = {xi;i = !,••• , 10} are drawn randomly 
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from an exponential distribution with mean A; this imples that the true prior for X is given by 

10 

i=i 

where n(xi) are iid exponential with mean A. The parameter 9 is selected randomly from the interval 
(0, 1). Since the prior tends to be more and more flat for increasing A, in Example 3 we investigated if 
greater power results for an assumed non-informative prior on X for large values of A as opposed to smaller 
values of A, assuming that the underlying Poisson model is known. The results of Example 3 confirm our 
anticipation. 

In Example 4 we consider a variable selection problem, assuming the true model to be Poisson with 
mean 9 = 9\Xi + 9ix\. We then assess which of the three cases: (a) 9 = 9±xf, (b) 9 = 9\X{ + 9^x1 and (c) 
9 = 9\Xi + 9<ix\ + 9^xf, is appropriate. Our IRD approach correctly identified the true model (b) most of 
the time. 

In Example 5 we attempt to clarify the phenomenon of overfitting and that it can be detected by our 
IRD approach. Here, given 9 and xf, i = 1, • ■ ■ ,10 (drawn from a uniform distribution), we asusme that 
Hi ~ Poisson(9xi), but suppose that y» has been modeled as a Geometric distribution with parameter 
Pi = 1/(1 + 9xi). Here although the expected value of m under both the models is the same, the variance 
under the Geometric model is greater than in the Poisson case. Thus, for certain values of 9 the Geometric 
model may overfit the data which actually comes from the Poisson model, and the discrepancy measure in 
such a case may turn out to be too small, which would lead to acceptance of the Geometric model unless 
our approach based on reference distribution is used. We present such a case with 9 = 15, where the 
discrepancy measure is small, apparently suggesting acceptance of the Geometric model. But with respect 
to the reference distribution this measure is too small to lead to acceptance of the wrong Geometric model, 
thus demonstrating a very desirable feature of our IRD approach. 

We next consider application of our methodology to the motivating palaeoclimate example. 
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7 Application of IRD approach to the motivating palaeoclimate exam- 
ple 

Vasko et al. (2000) reported a regular MCMC cross-validation exercise for a data set comprising multivariate 
counts yi on m = 52 species of chironomid at n = 62 lakes (sites) in Finland. The unidimensional Xi denote 
mean July air temperature. As species respond differently to summer temperature, the variation in the 
composition provides the analyst with information on summer temperatures. This information is exploited 
to reconstruct past climates from count data derived from fossils in the lake sediment; see Korhola et al. 
(2002). 

The cross-validation exercise was computationally challenging, requiring 62 separate regular MCMC 
exercises and involved a parameter 8 of dimension 3318. However, implementation of cross-validation by 
regular MCMC is not infeasible in this case. But the problem seems to be an ideal real life problem where 
the performance of IRMCMC can be tested by making comparison with regular MCMC and complete 
details regarding this can be found in Bhattacharya (2004). 

In the case of Vasko et al. (2000), our MCMC implementation took 16 hours. In contrast, the IRMCMC 
implementation took 16 minutes for the initial run and 20 minutes for the remaining 61. Additionally, 
IRMCMC drew attention to the bimodality of one of the posteriors, a point completely missed by the 
MCMC implementation. For details, see Bhattacharya and Haslett (2007). To proceed with the goodness 
of fit test, we first provide description of the underlying model. 

7.1 Model description 

In Vasko et al. (2000), the vector yi of counts at site i followed the multinomial distribution, 

{yi | y i+ , p^ ~ Multinomial(y i+1 p;). (15) 
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Here yi = (yn, • • • , y im ), y i+ = YZ=\ Vik an d P« is an (unobserved) vector of relative abundances (pn, ■ ■ ■ ,Pi m ), 
of dimensionality (m — 1) = 51. We denote the multinomial likelihood as 

52 
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(16) 



The unobserved {p«; i = 1, ■ ■ ■ , n}, thus provide 62 x 51 parameters, even before temperature Xi is related 
to the relative abundances. Vasko et al. (2000) related these via a Dirichlet model, 



( Pi | Xi, *i,--- ,* 



52; 



Dirichlet(Ai 



(17) 



where the fcth component Ajfc of Aj was modelled as 

Ajfc = A(xj, ^fe), for a simple function A of x« and of ^> k = (o^, /3fc, 7fc), a 3-component parameter vector 
associated with the A;th species. Vasko et al. (2000) chose a simple unimodal "response function" of these 
species specific parameters, given by 



\(xi,W k ) = "fcexp 



Ik 



(18) 



The mode, f3 k , represents the value of temperature at which the species k is most abundant. Tolerance of 
the species is denoted by 7^ and a k is a scaling factor. There are thus an additional 3 x 52 parameters, 
yielding 3318 in total. We write 6 = {pi, • • • , P62, *i, ■ ■ • , ^52}- 

As for the priors, Vasko et al. (2000) assume that a k ~ Uniform(0.1, 50), (3 k ~ Normal(ll.l9, 1.57 2 ), 
7fc ~ Gamma(9, 3) (that is, a Gamma distribution with mean 3 and variance 1) and Xi ~ A^orma/(11.19, l.ll 2 ). 



7.2 Results of assessment of model fit using the IRD approach 

Observed T\(X) and the posterior distribution of T\(X) are shown in Figure 1. Note that T\{X) is located 
far from the mode of T\(X), indicating that the model does not fit the data. In fact, an application of the 
formal Bayesian hypothesis testing procedure gives, for any sensible choice of e, 



p = IT 



Ti(X)-Ti(V) 



V^T^X) I Y) 



< e I Y 



0. 
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This is a consequence of the fact that many observed temperature values are far from the modes of the 
respective posterior density; see Bhattacharya (2004). In fact, it has been found that more than 40% of the 
observed data lie outside the 95% highest posterior density credible regions, suggesting poor fit of the model 
to the data. We anticipated that the reason for this lack of fit is that the assumed unimodal model used to 
describe Aj& in (18) is inappropriate. Indeed, it has been argued in the palaeoclimate literature that species 
can have multiple climate preferences, in which case the unimodal model is inappropriate. Bhattacharya 
(2006) used another modelling approach where, rather than unimodal functions, the response functions A,fc 
were modelled as mixtures of normal densities; the number of components being unknown. He viewed the 
parameters associated with each component of the mixture as samples arisen from the Dirichlet process 
(see, for example, Ferguson (1974), Ferguson (1983), Escobar and West (1995)). This way of modelling 
automatically induces a prior on the number of components; see Antoniak (1974). This approach to 
modelling the response surfaces improved the model fit, although it is yet to be completely satisfactory. In 
Section S-5 we describe this in detail. 

8 Conclusions 

The IRD approach is simple and we have attempted to provide clear cut guidelines when to accept or 
reject the model in question. It also seems to have very general applicability. A key point of our proposal 
is that it does not recommend acceptance or rejection of a model by noting the magnitude of an observed 
discrepancy measure alone. 

One important point to note is that there are no parameterisation problems in our approach with 
reference distributions since all parameters other than X are integrated out. No asymptotic theory, of any 
sort, is needed to make this approach work. Importantly, the data is not used twice and P-values have 
been replaced with Bayesian credible regions. In particular, the latter point makes our approach "more 
Bayesian" compared to the other available approaches. 

Simulation from the cross- validaton posteriors tt(x | -X"_j, Y) for each i, needed to compute the reference 
distribution corresponding discrepancy measure, appears very demanding at the first sight, particularly if 
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there are a large number of cases. However, IRMCMC is a method that seems to be highly suitable for 
computing the pairs cheaply and efficiently. We recommend IRMCMC for the computational needs of this 
model assessment proposal. 

All said, however, there is certainly scope for further investigations. In fact, our aims regarding this 
paper is quite modest — to indicate potential advantages as well as to shed light on issues involved in our 
proposal that need future attention. For instance, the issue regarding the prior on X deserves more careful 
attention. We believe, that for appropriate informative prior it is possible to overcome the slight deficiency 
of the power that our approach seems to currently exhibit. One important issue that we ducked in this 
paper concerns questions regarding appropriate discrepancy measures. It will be interesting to address 
optimality properties of discrepancy measures. There may also be questions regarding the reliability of 
the IRD appraoch if the the posterior distribution of x%, for some, or all i, are multimodal. However, in 
such cases, there exist choices of T for which the posterior distribution of T(X) will be unimodal; see, for 
example, Baker (1930). In the assessment of Vasko's model, one Xi became bimodal, and in our model in 
Haslett et al. (2006), most x\\ 1 = 1, . . . , 7815 were multimodal; for details, see Bhattacharya and Haslett 
(2004), Bhattacharya (2004). But the discrepancy measure T\{X) was unimodal in all cases. 

Another topic for future research is to systematically address the question of the efficiency of our 
proposal when the number of covariates is allowed to be very large. Although the general methodology 
presented in this paper will certainly remain valid for multi-dimensional covariates, we anticipate that 
it may be slightly difficult to devise appropriate overall discrepancy measures of goodness of fit. It is 
worthwhile to note in this connection that Bhattacharya and Haslett (2004) addressed goodness of fit of 
the complicated palaeoclimate model of Haslett et al. (2006), where there are two covariates instead of one; 
see also Bhattacharya (2004) for more detail. In the above-mentioned research two separate discrepancy 
measures were constructed instead of a single overall measure of fit. The final conclusions regarding 
goodness of fit of the model, however, were consistent with respect to the two independent discrepancy 
measures. We also remind the reader that our proposal is ideally suited for model assessment in inverse 
regression problems. However, this seems to have quite good potential in assessing Bayesian model fit in 
general. We look forward to providing a detailed separate paper on issues regarding application of our 
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methodology to problems other than inverse regression. 
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Table 1 : Comparison of reference distribution approaches 



IRD 


Bayarri and Berger 


Gelman et al. 


Fully Bayesian approach 


Not fully Bayesian 


Not fully Bayesian 


Uses tt(X 1 Y) 
as reference distribution 


Uses a modified version of ir(Y X, Y) 
as reference distribution 


Usesyr(y | X,Y) 
as reference distribution 


Measure independent of Y 


Depends on Y 


Depends on Y 


Measure independent of 9 


May depend on 8 


May depend on 6 


Avoids double use of data 


Asymptotically avoids double use 


Uses data twice 


Uses credible sets, 
not P-values 


Uses P-values 


May use credible sets 
but directly related to P-values 


Has calibration property 


Asymptotically has calibration property 


No calibration property 


Computation easy 


Computation hard 


Computation easy 



Vasko, K., Toivonen, H. T., and Korhola, A. (2000). A Bayesian multinomial Gaussian response model for 
organism-based environmental reconstruction. Journal of Paleolimnology, 24, 243-250. 
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Figure 1: Distribution of D\ ar ; the vertical line is the observed value, D° 
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Supplement to "A Fully Bayesian Approach to Assessment of Model 

Adequacy in Inverse Problems" 

j^ Sourabh Bhattacharya* 

^ February 26, 2013 

§ 

j S-l Discrepancy measures for model assessment in inverse problems 

i__i Using an inverse cross-validation approach, we first simulate, for each i = 1, • ■ ■ , n, N realisations from the 

y—i distribution ir(xi \ X-i, Y). Let the simulated values be denoted by {x\ , • • • , x\ }. The simulation can 

> 

CO be carried out very efficiently by using a methodology proposed by Bhattacharya and Haslett (2007); we 

o 

■^" discuss this briefly in Section S-3. Some examples of T(X) are as follows: 
(N 

O Tl{X) = ^ ^7^ V 

^ :/;>;:\:) y-^-=^n ( ; 2 ) 

X 

T A (X) = x t (4) 

We make no argument on the merits and demerits of the above discrepancy measures. However, note 
that while measures T±(X), T2(X) and T%(X) provide summaries of distances between the observed values 




T 3 (X) = max <J ^ =^» }> (3) 



*Bayesian and Interdisciplinary Research Unit, Indian Statistical Institute, 203 B. T. Road, Kolkata-700108. Corresponding 
e-mail: sourabh@isical.ac.in. 



Xi and the corresponding summaries of the leave-one-out posteriors ir(- \ X-j,Y), the measure T^{X) is 
just the observed value for case i and thus is different from all other measures in the sense that it is not an 
overall measure of fit. Rather it provides insight specifically into the case i. For example, it can be used 
to check whether or not Xi is an outlier with respect to the underlying model. In this context we note that 
there may exist measures corresponding to which no reference distribution may be easily available. For 
instance, a measure T§(X) may be defined as the number of Xi that fall within the 100(1 — a)% credible 
region of the corresponding leave-one-out posteriors. In this case there seems to exist no easily computable 
reference distribution. 

S-2 Discussion regarding priors on (X,9) 

It is important to have some discussion on the choice of priors on X and 9. Since 8 is a set of model 
parameters, the issue of choice of the prior on 6 is generic. In the absence of any information, which 
is usually the case, it is natural to put a somewhat vague prior (usually non-informative) on the model 
parameters 6, hoping that the true value is supported by the prior. In our illustrations, we use non- 
informative (improper) priors on 6. 

The issues regarding the prior on X are more interesting. Note that the true values X are known, so it 
is tempting to put an overly strong prior on X which assigns all mass to the true values X. However, this 
choice of prior is certainly inappropriate for assessing model fit, since irrespective of the suitability of the 
model to the data, the posterior of T{X) will be a point probability mass at T(X), thus reflecting only the 
prior aspect. For proper model checking it is necessary to make the prior parameter space of X as large as 
possible, so that all possible values of the covariates are explored. One can then observe, whether or not 
the observed covariates get high density a posteriori. 

The preceding discussion seems to suggest a non-informative prior for X. However, if prior information 
of the covariates is available, then there is no reason not to use the information to construct an appropriate 
prior for X. In fact, Example 3 of Section S-4 demonstrates that when prior information about X is 
available, then it is less efficient to put a non-informative prior on X. In the palaeoclimate study reported 



in Haslett et al. (2006) prior information on the past climates were available, which were used to reconstruct 
past climates from fossil pollen data. In the palaeoclimate example in Section 7 of Bhattacharya (2012) we 
use available prior information on the unknown covariates to implement our proposed model assessment 
idea. In that problem it is assumed that the components of X are a priori iid. We remark here that the 
priors for both modern and past climates are obtained from experts before observing the data (modern or 
fossil). When the past is not too far from the present, one can use the same prior for both modern and 
past climates. For general problems, however, such prior information will not be available. In such cases 
one possibility is to estimate some features of the prior distribution using empirical Bayes analysis. In fact, 
the latter procedure, which uses data to reliably estimate features of the prior distribution, has received 
wide attention in the Bayesian statistical literature. For details on this procedure see Berger (1985); see 
also Carlin and Louis (2000). Going by the principles of empirical Bayes methods, it is not unreasonable to 
estimate at least some features, say, moments of the prior distribution on X based on observed covariates 
X. It is important to remind the reader in this context that strictly speaking, X is not the data, since 
unlike in the case of Y, there is no probability model associated with X. Only Y, which has a probability 
model, given the covariates X, should be strictly regarded as the data. Hence we argue that estimation of 
some features of the prior on X using observed X is a reliable procedure; it is neither non-Bayesian, nor 
does it lead to double use of the data. As an aside, and as a possible topic for future research, we note that 
it may also be advisable to check robustness of the results on model assessment with respect to several 
plausible priors on X, including the one obtained by empirical Bayes analysis. In the palaeoclimate study 
reported in Haslett et al. (2006) prior information on the past climates were available, which was used to 
reconstruct past climates from fossil pollen data. However, Haslett et al. (2006) also performed limited 
sensitivity analysis; for complete details, see Bhattacharya (2007). In this research, however, we do not 
discuss sensitivity analysis. 



S-3 Computation of inverse leave-one-out posteriors 

Sampling from the cross-validation posteriors 7t(xj | X-i,Y);i = l,...,n seems to be very demanding 
at the first glance, since, in principle, n many computer-intrensive runs of regular MCMC, which we call 
n-fold regular MCMC, are necessary. Bhattacharya and Haslett (2007) show that the approach proposed 
by Gelfand et al. (1992), Gelfand (1996) which is based on importance sampling (see, for example, Geweke 
(1989)) in the context of forward problems is inapplicable to inverse problems. However, a novel method- 
ology proposed by Bhattacharya and Haslett (2007) seems to be very promising in this regard. The above 
authors refer to the methodology as Importance Resampling MCMC (IRMCMC). The key idea is to leave 
out case i*, to sample by regular MCMC realizations of (xi*,9), given X^i*,Y, and to draw a subsample 
of 9 values using appropriately constructed importance weights. 

Given each re-sampled 6, MCMC may be used to realise Xj from the conditional distribution of ir(- \ 
Hi, 9). In particular, MCMC needs to be carefully implemented once, to a selected case i*, generating 
realisations of (xi*,9). For all cases other than i* the resultant sample of 9 values may be re- used using 
importance resampling (IR) (see, for example, Rubin (1988)). In fact, the proposal of Bhattacharya 
and Haslett (2007) is equivalent to resampling both (xi,6) using importance resampling but subsequently 
ignoring Xj, retaining 6 only. Critically, for each such 9, sampling would be done from the low-dimensional 
7r(- J yi,9), for constant 9, typically by MCMC. The latter exercise is very fast. Choice of i* has been 
discussed in detail by Bhattacharya and Haslett (2007); in particular, they show that it is easy to choose 
i* appropriately. 

The proposed procedure of Bhattacharya and Haslett (2007) can be stated in the following manner. 

1. Choose an initial case i* . Use ir(xi,9 \ X-i*,Y) as the importance sampling density. 

2. From this density, sample values {xS',0^ ');£ = 1, • • • ,N, for large N. Typically, regular MCMC will 
be used for sampling. 

3. For ie {l,--- ,i*-l,i* + l,--- ,n) do 

a. For each sample value (x^',6^>), compute importance weights io-» \ = Wi*^{x^\6^>), where the 



importance weight function is given by 

7r(x i .,fl|jr_ i ,y) L(y,x_,,x„g) _ /( yi . | x t *,e)f( yi \ x t *,e) 
Wl * AXu ' *(xi.,e\ x^,y) °" L{Y,x^,xi,e) "" f( yi . | xi.,e)f(yi \ Xi ,e)' u 

In the above, L(Y, X, 9) is the likelihood of the observed data under the model. 

b. For k£ {I,--- ,K} 

(i) Sample 9^ k > from O^ 1 ', • • • , 9^ N > where the probability of sampling 9^> is proportional to w/L ',. 

(ii) For fixed 6 = 9^ k >, draw M times from ir(xi \ yi,9^ k >). Note that in general it is not easy to 
sample from 7t(xj | yi, 9^ k '), even if xi is univariate, and we recommend MCMC for generality. 

c. Store the K x M draws of x\ as x\ , . . . , x\ 

The key idea in the above proposal is the use of 7r(5j, 6 \ X_j» , Y) as the importance sampling density, for 
some particular i*. Bhattacharya and Haslett (2007) demonstrate that it is easy to choose an appropriate 
i*. 

It is shown by Bhattacharya and Haslett (2007) that IRMCMC is MCMC with a special proposal 
kernel. They demonstrate that compared to n-fold regular MCMC, IRMCMC is many times faster than 
regular MCMC and mixes as least as good as regular MCMC. That IR yields reliable approximation in 
this cross-validation proposal is clear, since IR is used only to sample 9 and the importance sampling 
density tt(9 \ X-i*,Y) is a good approximation to ir{9 \ X-i,Y) for any i. It is important to note that 
the posterior 7t(xj* | X_i*,Y) is generally not a good approximation for 7t(xj | X-i,Y); in fact, they 
usually have disjoint supports. So, very reasonably, we have avoided IR to sample from the required 
ir(xi | X-i,Y), using instead resampled values of 8 to sample, via regular MCMC, from 7t(xj | yi,9). 
An important technical question is whether IR whould be used with or without replacement. Although 
most of the references to IR in the literature recommend IR with replacement (see, for example, Gelfand 
et al. (1992), Newton and Raftery (1994), O'Hagan and Forster (2004)), Gelman et al. (1995), Stern and 
Cressie (2000) recommend IR without replacement. They argue that sampling without replacement can 
provide protection against highly variable importance weights. In fact, Skare et al. (2003) formally prove 



a theorem that with respect to the total variation norm, IR without replacement is better than IR with 
replacement. Hence Bhattacharya and Haslett (2007) recommend IR without replacement. Bhattacharya 
(2004) provides further details in this context including a comparison of IR with/without replacement. 

S-4 Illustration of inverse model assessment with the reference distri- 
bution approach 

It has been argued that the reference distribution approach in inverse problems, which is the main con- 
tribution in this work, has some desirable properties and that the computational challenge involved may 
be overcome by IRMCMC. We now illustrate the approach by applying it on various problems involving 
repeated computer-simulated data and mainly noting the percentage of times it gives the correct answer. 
However, we acknowledge that since we obtain only point estimates of true percentages, our evaluation 
procedure may not be completely adequate. 

In the following illustrations we emphasize that experimental evaluation sheds more light on the par- 
ticular choice of discrepancy measure. Even on any particular choice of discrepancy measure experimental 
replications can shed limited light. But since here we are concerned with simulation studies, where the 
true models and their properties are completely known, we may suppose that the point estimates provide 
useful evidence on the general performance of our approach based on reference distributions. Besides, we 
provide other relevant experimental details to supplement the inadequacy of the point estimates. 

In none of our examples do we claim optimality of any particular discrepancy measure. Throughout 
all illustrations the results based on the discrepancy measure T\(X) and the corresponding reference dis- 
tribution T±(X) will be presented. In the examples, we consider that the model fits the data if Ti(X) falls 
within approximately 97% credible region of T\(X) (here 97% does not have any special significance, but 
we chose this merely because we found that, in the experiments the percentage of times the correct answer 
is obtained is often close to 97% if 97% credible regions of T\(X) are chosen! We could have certainly 
chosen 100(1 — a)% credible region of T±(X) for any < a < 1). 



Example 1: Forward regression 

Our first example concerns a forward problem. But we consider this foward regression problem mainly to 
contrast it with later examples on inverse regression problems. In this example, the data actually comes 
from a Geometric distribution but has been modeled in reality as involving the Poisson distribution. In 
other words, given x±, ■ ■ ■ , xio, which are drawn randomly from Uniform(l, 2), data yi ~ Geometric(pi) , 
where pi = 1/(1 + 6xi). It is assumed, for purposes of illustration, that the data has been modeled 
as Poisson(Oxi). A uniform improper prior has been put on 9, that is, ir(9) = 1; 6 > 0. Note that, 
had fji been Poisson(9xi), then E(yi) = 9x; L = Var(yi). But for the Geometric case, E(yi) = 9xi but 
Var(yi) = E(yi)(l + 9xi). In this example x% 6 (1,2) and 9 > 0. Since Xi are bounded, for 9 close to 
zero Var(yi) ~ E{yi) and we can expect Poisson and Geometric distributions to agree. However, if 9 is 
large, then Var(yi) >> E(yi) and the two distributions are expected to disagree. We considered 1000 
simulations from the Geometric distributions with the above set-up with different values of 9 and applied 
our methodology in each case to assess the goodness-of-fit of the Poisson model to the Geometric data. 
Subsequently the true model has also been applied on the data to contrast with the fit achieved by the 
Poisson model. The results are given in Table S-l. For example, for 9 = 0.1 and the true model is Geo- 
metric, the erroneous Poisson model is accepted 97% times (false positive) and the true Geometric model 
is rejected 0.3% times (false negative). In Example 2 we will contrast this with the inverse case. 

Observe that, as 9 increases, the Poisson model agrees less and less with the Geometric model. This 
is because the mean and the variance of the Geometric distribution drift apart as 9 increases. In fact, 
the percentages of agreement by the Poisson model decrease quite fast. It will be pointed out that in 
the inverse case that the decrease is relatively slow in comparison. Note that when the Geometric model 
is applied to the data it fits the data very well in all the cases. This is to be expected since it is the 
true model. In the inverse case it will be seen that the percentages of agreement by the Geometric model 
are comparatively slightly less. It will be argued that at the cost of other theoretical and computational 
advantages, the inverse model checking approach may have slightly less power compared to the forward 
approach. 



Example 2: Inverse regression 

In the first example we considered a problem involving the Geometric distribution as the true model but 
modeled as Poisson distribution. There assessment of the model fit used pairs {yj,7r(yj | X, Ylj)}. In this 
example we consider the same problem but now we focus on the pairs {xi,n(xi \ X_i,Y)} instead. We 
remind the reader that herein lies our interest. In contrast to the previous forward example, here we need 
to put a prior on X{ (in addition to the prior on 9, which we assume the same as in the previous example). 
We put the correct prior on xf, that is X{ ~ Uniform(l,2) (recall that in Example 1 Xi has been drawn 
randomly from Uniform(l,2)). 

We provide in Table S-2 abridged results of varying the parameter 6. The table clearly shows that 
Poisson disagrees more and more with the Geometric model as 9 increases, and thus the difference between 
mean and variance of the Geometric model, increases. In other words, the percentage of false positives 
decreases fast as 9 increases. On the other hand, the percentage of false negatives do not show any 
appreciable change with 9. It is important to note that, in this example, we have used the true prior for 
X, but as mentioned before this has no implication on the model adequacy test of our proposed inverse 
approach; it does not accept the model when it is false. In other words, even though the prior on X is 
assumed to be correct, our proposal correctly rejected the model corresponding to the incorrect probability 
distribution of Y and correctly accepted the model whenever the probability distribution of Y is sufficiently 
close to the true probability model. But here one must note the contrast between Table S-l, of the forward 
case, and Table S-2, corresponding to the inverse case. In the former table the percentage of disagreement 
of the Poisson model with the Geometric data increases slightly faster than in the table corresponding 
to the inverse case. Also, the percentages of agreement of the Geometric model with the true Geometric 
data is slightly higher in the forward case. These observations indicate that the power of the test with the 
inverse approach is slightly less than that with the forward approach. This is because, with the inverse 
approach, a prior on X is used, which is generally weaker than the probability model of Y, which is used 
to compute ir(Y \ X, Y) in the forward approach. However, slightly less power is not to be interpreted as 
a major drawback of our approach. Certainly, the results with the inverse approach very clearly assert the 



reliability of our proposal, in spite of slightly less power. Moreover, we have already discussed in detail 
in our main manuscript Bhattacharya (2012) that the inverse approach has a solid theoretical framework 
and nice computational properties as compared to the other available approaches. We also believe that 
with informative priors on X, constructed from observed X using empirical Bayes analysis, will make up 
for the slight loss of power. 

We now introduce an example to check the prior assumption on X in an inverse problem. In particular 
we demonstrate that, when prior information about X is available, then using a non- informative prior for 
X is inefficient for model-checking purpose. 

Example 3: Inverse regression — implications of prior assumptions on X 

For i = 1, ■ ■ ■ , 10, data y, come from the true model Poisson{6xi). Data X = {xf, i = 1, ■ ■ ■ , 10} are drawn 
randomly from an exponential distribution with mean A. In other words, the true prior for X is given by 



n(X) = f[n( 



10 

where ir(xi) are iid exponential with mean A. The parameter 8 is selected randomly from the interval 



(0,1). 

Given the above set up we now assume that it is known to us that yi ~ Poisson(9xi), but that the 
prior distribution of X is unknown. We test whether a uniform improper prior is appropriate for X. 

We evaluate our approach with several different true values of A. Note that for an exponential distribu- 
tion with mean A, the variance is A 2 ; since uniform improper priors can be said to have infinite variance, we 
can expect the fitted model to agree with the true model when A is large and disagree when A is small. We 
summarise our findings in Table S-3. Very clearly, the results are in keeping with our expectations. Unless 
the true prior for X is reasonably flat, the assumed uniform improper prior is not very appropriate. In 
fact, the conclusions drawn from this example are in agreement with the power issue discussed in the first 
two examples. From the current example it is clear that a properly elicited informative prior, which can 
be thought of as a good representative of the true prior, can improve the power of the inverse approach. 
We remark here that a prior for X estimated using observed X and principles of empirical Bayes analysis 
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is likely to approximate the true prior very accurately and hence will be far more appropriate than the 
uniform improper prior used. It has already been argued why using observed X to construct the prior for 
X makes sense. 

Example 4: Variable selection 

In addition to the above three examples, we have also conducted a variable selection study, assuming the 
true model to be Poisson with mean = 8\Xi + 02xf. Here the true values of 0\ and 62 are 0.5 and the x\ 
were drawn randomly from Uniform(0, 10). As in the previous examples, here also we will present results 
based on T\(X) and T\(X) only. 

Given the above set up, we consider three cases: (a) 6 = OiXf, (b) 6 = 8\X{ + 02x\ and (c) = 
9\Xi + 02xf + 03xf. Clearly, except (b), others are incorrect. 

For each of the three models (a), (b) and (c), with the simulation procedure repeated 1000 times, 
we implement our approach based on T\{X) and T\{X) by simulating from the leave-one-out posteriors 
tt(x I X_i,Y), corresponding to uniform priors for all variables. Case (b) was adjudged the correct model 
95% times, cases (a) and (c) agreed with the true model 39% and 84% times respectively. 

It is not at all surprising that (c) turns out to be far better than (a); this is because (a) wrongly assumes 
that 62 = but (c) does not neglect the quadratic term. In fact, in addition, (c) considers an extra cubic 
term. Noting that the true model (b) can be written as 8 = 0\Xi + 02^1 + x x 3 , the true value of 0s in 
(c) can be said to to be zero. 

We remark that obtaining realisations from the leave-one-out posteriors is simple in the simple examples 
provided. However, this is certainly not a simple exercise in the case of the real examples reported in Section 
7 of our main manuscript, Section S-5, and Section S-6. In those cases IRMCMC is clearly necessary. 

We next discuss the use of reference distributions in detecting overfitting in models. In particular we 
demonstrate that even when the observed discrepancy measure is small, this does not necessarily lead 
to acceptance of the model in question. In such cases, basing decisions solely on the smallness of the 
magnitudes of the observed discrepancy measures may be quite misleading. Below we illustrate this with 
an example. 
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Example 5: Overfit in inverse regression 

Given 9 and xf, i = 1, ■ ■ ■ ,10 which arise from a uniform distribution, suppose that m ~ Poisson(0Xi). 
But suppose that yi has been modeled as a Geometric distribution with parameter pi = 1/(1 + 9xi). Here 
although the expected value of yi under both the models is the same, given by 9xi, the variance under 
the Geometric model given by 9xi{l + Oxi) is greater than that in the Poisson case, where it is given by 
9x{. Thus, for certain values of 9 the Geometric model may overfit the data which actually comes from the 
Poisson model. Figure S-l presents a case with 9 = 15. In this case the discrepancy measure is too small 
with respect to the reference distribution. Thus the Geometric model is to be considered a poor fit to the 
observed Poisson data. Bhattacharya (2004) (see chapters 7 and 9) discusses two real cases of overfitting. 

S-5 Improving palaeoclimate model by modelling response surface as 
a mixture of Gaussian curves 

In Section 7.2 of Bhattacharya (2012) it is shown that the model of Vasko et al. (2000) does not fit the data. 
Further investigation using exploratory data analysis suggested that the unimodal model to relate species 
to environment may not be appropriate. In fact, in the palaeoclimate literature this unimodal model has 
been criticised on the ground that each species may have multiple climate preferences. Also, some species 
may represent an entire genus consisting of many sub-species, where each sub-species may have different 
climate preferences. So, even if the response curve for each sub-species is unimodal, the response curve for 
the genus of species is certainly not unimodal. For general discussion on this, see Haslett et al. (2006). 

In order to obtain an improved version of the model of Vasko et al. (2000) that takes into account the 
multimodal nature of response curves, Bhattacharya (2006) introduced a novel approach based on Dirichlet 
process to model a very flexible class of multimodal models to relate species to environment. We begin to 
describe his approach by defining, as an analogue of (17) of Bhattacharya (2012), the following 



1 

A*;fc = y^Rkj ^ ex P 

pi IkjV^r 



%i Pkj 
Ikj 



(6) 
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where for k = 1, . . . , m, r k is a discrete random variable taking values between 1 and R k + (both inclusive), 
where given fixed r k , R k+ = Y^jLi R kj- 

The above implies that the response function for the k species given by (6) is a mixture of Gaussian 
densities; the number of mixture components, denoted by r k , being unknown and hence regarded as a 
random variable. Certainly, it also includes the unimodal model as a special case, when the components of 
the mixture are all equal. Thus the response function is a multimodal function, with the number of modes 
(and indeed the magnitudes of the modes, /3k, and the scales, jk) being unknown. 

Equation (6) can be re-written as 

A^E^rexpLf^^V 



I / ./■ . — ,■■")/. : \ 

(7) 



Unlike in the case of (6), where the number of parameters is variable, in (7) the number of parameters 
{(Pkji7kj)}i<j<R k+ ;i<k<m is fixed. Below we show how (7) can be looked upon as analogous to (6) under 
appropriate modelling assumptions involving Dirichlet process. 

S-5.1 Modelling response surfaces using Dirichlet process 

Bhattacharya (2006) assume that for each k, the parameters 9 k i, ■ ■ ■ ,0kR k+ are samples from some prior 
distribution (?&(■) onSx 3ft + , where G k ~ T>(aGo) is a Dirichlet process defined by a, a positive scalar, 
and prior expectation Go(-), a specified bivariate distribution function over 3ft x ?R. + . In other words, 

[0fci, ■ ■ ■ , QkR k+ | G k ] ~ iid G k for k = 1, . . . , m 



and for each k, Gk ~ V(aGo); Gk are assumed to be independent. 

A crucial feature of the above modelling style concerns the discreteness of the prior distribution Gk, given 
the assumption of Dirichlet process; that is, under these assumptions, the parameters 9kj are coincident 
with positive probability. This is the property that Bhattacharya (2006) exploits to show that (7) boils 
down to (6) under the above modelling assumptions. The main points regarding this are sketched below. 
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Marginalisation over G k yields 

[8kj I &ki, ■ ■ ■ ,9k,j-i,8k,j+i, ■ ■ ■ ,8kR k+ ] ~ ®aR k -iGo{9kj) + a R k+ ~i 2_^ ^e kl {&kj) (8) 

In the above, Sg kl (-) denotes a unit point mass at 9^1 and a,j = l/(a + j) for positive integers j. 

The above expression shows that the dkj follow a general Polya urn scheme, that is, the joint distribution 
of {0 fc i, . . . , kRk+ } is given by the following: 6 kl ~ G , and, for j = 2, . . . , R k+ , [6 kj \ 6 k i, ■ ■ ■ , 0fcj-i] ~ 
aaj-iGo(6kj) + aj-iX^f=i ^6 kt (^kj)- Thus, given a sample {#&!, . . . ,9k,j-i}, @kj is drawn from Go with 
probability aaj_i and is otherwise drawn uniformly from among the sample {Oki, ■ ■ ■ ,&k,j-i}- I n the 
former case, Okj is a new, distinct realisation and in the latter case, it coincides with one of the realisations 
already obtained. Thus, there is a positive probability of coincident values. For more on the relationship 
between a generalized Polya urn scheme and the Dirichlet process prior, see Blackwell and McQueen (1973) 
and Ferguson (1974). 

Now, supposing that a sample from the joint distribution of O^i, • • • , &kR k+ yields r k distinct realisations 
given by 9^, . . . , Q* kr , and if R k j denotes the number of times 0L appears in the sample, then R k i + . . . + 
Rkr k = Rk+- Hence, (7) reduces to (6). 

We remark that the prior for r k is implicitly induced with this modelling style; for more details, see 
Antoniak (1974), Escobar and West (1995). 

S-5.2 Choice of G 

To complete the Bayesian model description, it is necessary to specify the prior mean Gq(-) of G(-). 
Bhattacharya (2006) assume that under Go("))7fcj ~ 7(7(11,30), an inverse-gamma prior with mean 3 and 
variance 1, and [j3kj | jkj] ~ iV(11.19, ^g^7fcj 2 )- Note that, since E 2 (~fkj) = 9, very roughly, Var((3kj) ~ 
2.45, which roughly corresponds to the prior of Vasko el al. (2000). We need to specify a value of a. In 
order to do this reasonably, we adopt the following elicitation arguments. Note that the value of a is the 
one that approximately (in a subjective sense) optimises the trade off between unmodal and multimodal 
components, keeping in mind that a priori we expect the response surface to be multimodal. In other words, 
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it is necessary to reflect this "optimism" about multimodality into the prior for the numer of components. 
With a = 10 and the maximum number of components, denoted by R^ + = 10 for each k, the probability 
of obtaining a distinct component is a/(a + Rk+ — 1) = 0.53, which is slightly more than the probability of 
obtaining a non-distinct component, the probability of the latter being 0.47. Thus, a priori, our optimism 
about single component is slightly less than that about multiple components. This seems reasonable, given 
prior palaeoclimatological knowledge. With this choice Bhattacharya (2006) found that a posteriori the 
number of components of each species was less than 10. 

S-5.3 Results of model assessment using IRD 

In the posterior analysis, the number of components for each species was found to be greater than one 
with high probability, confirming that indeed the unimodal model for relating species to environment 
is inappropriate. From the cross-validation exercise with IRMCMC it was found that for this flexible 
multimodal modeling approach of Bhattacharya (2006) 82% of the observed values fell within 95% highest 
posterior density regions. This is a significant improvement over the unimodal modeling approach of Vasko 
et al. (2000) where more than 40% observed values were excluded from the 95% highest posterior density 
regions. However, the goodness of fit test as described in this work was not satisfied, showing that there is 
scope to further improve the model. We reserve as future research the task of further improving the model 
until it satisfies the goodness of fit test. Next we provide brief details of another much more complicated 
palaeoclimate problem. 

S-6 Brief discussion of goodness of fit test of a more difficult palaeocli- 
mate problem 

Bhattacharya and Haslett (2004) provide inverse cross-validation analysis of the much more complicated 
palaeoclimate model of Haslett et al. (2006). In that case, pollen data was used, rather than chironomid 
data and each cross-validation posterior involved 2 climate variables, 14 species of pollen and about 10,000 
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parameters. In all, there were 7815 cross-validation densities since there are as many cases. Brute force 
MCMC implementation to explore all the 7815 cross-validation densities is expected to take about 5 years, 
and this using highly sophisticated parallel computing architecture. However, using IRMCMC, the entire 
exercise was completed in less than 8 hours. For details on the implementation, see Bhattacharya and 
Haslett (2004). Considering each climate variable singly, it was observed that the 95% HPD credible 
regions of both the climate variables were rather large. Hence, although in about 92% and 97% cases the 
true values were included within the respective HPD credible intervals of the two climate variables, our 
goodness of fit test applied individually to the two climate variables indicated that the model overfitted 
the data. One of the reasons that the model overfitted the data can be attributed to the presence of such 
a large number of unknown parameters in the model and the use of vague priors. Moreover, a very large 
number of the cross-validation densities turned out to be highly multimodal. Bhattacharya and Haslett 
(2007) demonstrate by simulation study that lack of homogeneity between different species is the reason 
for the multimodalities. In other words, since different species have different preferences for climate, the 
densities of climate variables were forced to be multimodal. Clearly, such multimodalities, which result 
due to clashing information, increase the uncertainty about climate variables. For complete details on the 
assessment of fit of this pollen based palaeoclimate model, see Bhattacharya (2004). 
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Table S-l: Forward problem: assessment of Poisson model fit when the true model is Geometric. 



Parameter(#) 


Poisson agreement (%) 


Geometric agreement (%) 


0.1 


97.0 


99.7 


1.0 


59.7 


99.0 


3.0 


13.2 


98.8 


5.0 


3.3 


98.5 


7.0 


0.8 


99.1 


15.0 


0.0 


97.5 



Table S-2: Inverse problem: assessment of Poisson model fit when the true model is Geometric. 



Parameter(#) 


Poisson agreement (%) 


Geometric agreement (%) 


0.1 


97.3 


97.3 


1.0 


89.3 


97.1 


3.0 


63.5 


97.5 


5.0 


34.7 


97.7 


7.0 


18.4 


97.8 


15.0 


1.4 


97.6 



Table S-3: Assessment of inverse model fit. 



Exponential mean(A) 


Agreement percentage 


0.5 


56.0 


1.00 


74.4 


3.00 


90.4 


10.00 


95.4 
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Figure S-l: Demonstration of overfitted situation in an inverse problem involving Poisson and Geometric 
model. Here considering solely the observed discrepancy measure (denoted by the vertical line) wrongly 
leads to acceptance of the overfitted model; considering it with respect to the reference distribution leads 
to the correct decision (that is rejection of the model). 
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